GPUs are ubiquitous in modern computers. Following are NVIDIA GPUs on today’s typical computer systems.
NVIDIA GPUs
H100 PCIe
RTX 6000
RTX 5000
Computers
servers, cluster
desktop
laptop
Main usage
scientific computing
daily work, gaming
daily work
Memory
80 GB
48 GB
16 GB
Memory bandwidth
2 TB/sec
960 GB/sec
576 GB/sec
Number of cores
???
???
???
Processor clock
??? GHz
??? GHz
??? GHz
Peak DP performance
26 TFLOPS
??? TFLOPS
??? TFLOPS
Peak SP performance
51 TFLOPS
91.1 TFLOPS
42.6 TFLOPS
2 GPU architecture vs CPU architecture
GPUs contain 1000s of processing cores on a single card; several cards can fit in a desktop PC
Each core carries out the same operations in parallel on different input data – single program, multiple data (SPMD) paradigm
Extremely high arithmetic intensity if one can transfer the data onto and results off of the processors quickly
3 GPGPU in Julia
GPU support by Julia is under active development. Check JuliaGPU for currently available packages.
There are multiple paradigms to program GPU in Julia, depending on the specific hardware.
CUDA is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, …
The CUDA.jl package allows defining arrays on Nvidia GPUs and overloads many common operations.
The AMDGPU.jl package allows defining arrays on AMD GPUs and overloads many common operations.
The Metal.jl package allows defining arrays on Apple Silicon GPU and overloads many common operations.
AppleAccelerate.jl wraps the macOS Accelerate framework, which provides high-performance libraries for linear algebra, signal processing, and image processing on Apple Silicon CPU. This is analog of MKL for Intel CPU.
The oneAPI.jl package allows defining arrays on Intel GPUs and overloads many common operations.
I’ll illustrate using Metal.jl on my MacBook Pro running MacOS Sequoia 15.4. It has Apple M2 chip with 38 GPU cores.
versioninfo()
Julia Version 1.11.5
Commit 760b2e5b739 (2025-04-14 06:53 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin24.0.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_NUM_THREADS = 8
JULIA_EDITOR = code
usingBenchmarkTools, LinearAlgebra, RandomRandom.seed!(257)n =2^14# on CPUx =rand(Float32, n, n)y =rand(Float32, n, n)z =zeros(Float32, n, n)# on GPUxd =MtlArray(x)yd =MtlArray(y)zd =MtlArray(z);
6.1 Dot product
# SP matrix dot product on CPU: tr(X'Y)bm_cpu =@benchmarkdot($x, $y)
BenchmarkTools.Trial: 138 samples with 1 evaluation per sample.
Range (min … max): 34.377 ms … 56.968 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 34.923 ms ┊ GC (median): 0.00%
Time (mean ± σ): 36.388 ms ± 3.159 ms┊ GC (mean ± σ): 0.00% ± 0.00%
▆██▄▃ ▁ ▁
███████▅▇▁▅▅▁█▇▇▁▅▇▅▇▇▇▁▅█▁▁▁▇▁▅▅▁▁▅▁▁▁▅▁▅▁▅▁▅▁▁▁▅▁▁▅▁▇▁▅▁▅ ▅
34.4 msHistogram: log(frequency) by time 45.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
# SP matrix dot product on CPU using Apple Accelerate: tr(X'Y)bm_acc =@benchmark AppleAccelerate.dot($x, $y)
BenchmarkTools.Trial: 143 samples with 1 evaluation per sample.
Range (min … max): 34.379 ms … 40.565 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 35.034 ms ┊ GC (median): 0.00%
Time (mean ± σ): 35.193 ms ± 982.134 μs┊ GC (mean ± σ): 0.00% ± 0.00%
█▂ ▂▅▂
▃████▆█▇███▅▅▄▃▁▃▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▃▁▁▁▃▁▁▃ ▃
34.4 ms Histogram: frequency by time 39.9 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
# SP matrix dot product on GPU: tr(X'Y)# why are there allocations?bm_gpu =@benchmark Metal.@syncdot($xd, $yd)
BenchmarkTools.Trial: 662 samples with 1 evaluation per sample.
Range (min … max): 7.447 ms … 8.989 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.539 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.556 ms ± 93.986 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▃█▃▅▁▁▁▃▃ ▁
▃▄▄▄▇▇█▆███████████▆▆▇█▆▇▇▅█▇▇▆▄▄▄▄▂▄▂▃▂▄▂▃▃▂▂▃▃▂▁▃▁▂▂▁▃▂▂ ▄
7.45 ms Histogram: frequency by time 7.78 ms <
Memory estimate: 21.24 KiB, allocs estimate: 837.
# speedup by Apple Acceleratemedian(bm_acc.times) /median(bm_cpu.times)
1.0031748038858854
# speedup on GPU over CPUmedian(bm_cpu.times) /median(bm_gpu.times)
4.632175281234197
6.2 Broadcast
# SP broadcast on CPU: z .= x .* ybm_cpu =@benchmark$z .=$x .*$y
BenchmarkTools.Trial: 142 samples with 1 evaluation per sample.
Range (min … max): 34.871 ms … 36.395 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 35.260 ms ┊ GC (median): 0.00%
Time (mean ± σ): 35.330 ms ± 307.233 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▄ ▄ ▁██▁▆▄▃▁▄▃▁ ▃ ▁ ▄
▆█▆█▇▁▆▇▇███████████▇▄▆█▇▇▇▁█▄▆▁▄▄█▄▇▆▄▄▄▄▇▄▄▄▁▁▁▁▁▁▁▁▁▁▁▁▄▄ ▄
34.9 ms Histogram: frequency by time 36.3 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
# SP broadcast on GPU: z .= x .* y# why is there allocation?bm_gpu =@benchmark Metal.@sync$zd .=$xd .*$yd
BenchmarkTools.Trial: 564 samples with 1 evaluation per sample.
Range (min … max): 8.755 ms … 9.588 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.855 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.867 ms ± 76.948 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁▂ ▃▆▆▂▁█▄▄▂▂▂▄▇ ▂
▄▃▄██▇███████████▇█████▇█▇▆▄▅▇▄▆▄▃▃▃▃▃▂▃▃▂▁▂▁▂▁▂▁▂▂▁▁▁▁▂▁▂ ▄
8.76 ms Histogram: frequency by time 9.12 ms <
Memory estimate: 4.53 KiB, allocs estimate: 177.
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
Single result which took 7.649 s (0.00% GC) to evaluate,
with a memory estimate of 1.00 GiB, over 3 allocations.
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
Single result which took 7.635 s (0.00% GC) to evaluate,
with a memory estimate of 1.00 GiB, over 3 allocations.
We don’t see GPU speedup of Cholesky at the moment.
7 Evaluation of elementary and special functions on GPU
7.1 Sine and log functions
# elementwise function on GPU arraysfill!(yd, 1)bm_gpu =@benchmark Metal.@sync$zd .=log.($yd .+sin.($xd))bm_gpu
BenchmarkTools.Trial: 563 samples with 1 evaluation per sample.
Range (min … max): 8.769 ms … 9.569 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.871 ms ┊ GC (median): 0.00%
Time (mean ± σ): 8.879 ms ± 67.372 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▁ ▂▂▃█▄▆▁▄▂▃▆▁▃▂▄▃
▂▃▅▆▆▆███▇█████████████████▇▅▆▇▇▄▄▄▅▂▄▄▄▃▁▃▁▂▃▁▁▁▂▂▂▂▁▁▁▁▂ ▄
8.77 ms Histogram: frequency by time 9.1 ms <
Memory estimate: 4.53 KiB, allocs estimate: 177.
# elementwise function on CPU arraysx, y, z =collect(xd), collect(yd), collect(zd)bm_cpu =@benchmark$z .=log.($y .+sin.($x))bm_cpu
BenchmarkTools.Trial: 2 samples with 1 evaluation per sample.
Range (min … max): 2.750 s … 2.755 s┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.753 s ┊ GC (median): 0.00%
Time (mean ± σ): 2.753 s ± 3.053 ms┊ GC (mean ± σ): 0.00% ± 0.00%
█ █
█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
2.75 s Histogram: frequency by time 2.75 s <
Memory estimate: 0 bytes, allocs estimate: 0.